Week 8 - Support Vector Machines¶

Dr. David Elliott¶

1.6. Support Vector Classifier (SVC)

1.7. Support Vector Machine (SVM)

1.8. Exercises

TODO

  • have a look through http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf to see what you can add

We can't always perfectly separate the data with a $K − 1$ dimensional hyperplane. To overcome this problem we could either:

  • tweak the constraints on the hyperplane to allow some points to be misclassified (soft margin),
  • transform the data to be separable by a hyperplane in another space (kernel method).

1.6. Support Vector Classifier (SVC) ¶

SVC's are a generalisation and extension of the maximal margin classifier so it can be applied to a broader range of cases1.

In practice they are more robust to individual observations and better classify most training observations than the Maximal Margin Classifier. This is because they take the approach it is better to missclassify some training examples in order to do a better job classifying the rest.

This is called a soft margin as it allows some violations by the training data by a small subset of training observation, not only on the wrong side of the margin, but wrong side of the hyperplane.

Notes

  • "Developed in the computer science community in the 1990s"2

_"To extend SVM to cases in which the data is not linearly separable, we introduce the hinge loss function: $\max (0, 1 − y_i(\mathbf{w}\mathbf{x}_i − b))$. The hinge loss function is zero if the constraints in eq. 9 are satisfied; in other words, if wxi lies on the correct side of the decision boundary. For data on the wrong side of the decision boundary, the function’s value is proportional to the distance from the decision boundary."_8

We want to relax the constraints

$\mathbf{x}_i \cdot \mathbf{w} + b \geq +1 \quad \text{for} \ y_i = +1$,

$\mathbf{x}_i \cdot \mathbf{w} + b \leq -1 \quad \text{for} \ y_i = -1$,

when necessary.

This can be done firstly by introducing positive slack variables $\xi_i, i = 1, \ldots , l$ in the constraints5, which then become6:

$\mathbf{x}_i \cdot \mathbf{w} + b \geq +1 - \xi_i \quad \text{for} \ y_i = +1$,

$\mathbf{x}_i \cdot \mathbf{w} + b \leq -1 + \xi_i \quad \text{for} \ y_i = -1$,

$\xi_i \geq 0 \quad \forall_i$.

Notes

  • $\sum_i\xi_i$ is an upper bound on the number of training errors
  • $\xi_i$ tells us where the $i$th observation is located relative to the hyperplane; $\xi_i = 0$ being on the correct side of the margin, $\xi_i > 0$ being on the wrong side of the margin, and $\xi_i > 1$ on the wrong side of the hyperplane.
  • $\xi_1 = ... = \xi_n = 0$ is the maximal margin hyperplane optimisation.
  • slack variable $\xi_1,..., \xi_n$ allow individual observations to be on the wrong side of the margin or hyperplane.
  • test observations are classified as before, $f(x^*) = \beta_0 + \beta_1x^*_1 + ... + \beta_px^*_p$

Tuning Parameter (C)¶

To ensure there is a penelty, $C$, for relaxing the constraint, we can change our objective function to be minimised from $\frac{||\mathbf{w}||^2}{2}$ to,

$\frac{||\mathbf{w}||^2}{2}+C(\sum_i\xi_i)^k$.

If we set $k=1$ neither the $\xi_i$ or their Lagrange multipliers appear in the Wolfe dual problem. This means we now have6:

$L_D \equiv \sum_i\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_jy_iy_j\mathbf{x}_i\cdot \mathbf{x}_j \quad \text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i\alpha_iy_i = 0$.

This also has the same solution as before:

$w = \sum^{N_S}_{i=1}\alpha_iy_i\mathbf{x}_i$.

  • the $\alpha_i$ now have an upper bound of $C$.

  • In Sci-kit learn "the strength of the regularization is inversely proportional to C"

$C$ is a tuning parameter that controls the bias-variance trade-off.

When $C$ is small we have narrow margins rarely violated, but highly fit to the training data (low bias-high variance). Coversely, when larger, the margin is wider amounting to less hard fitting (high bias-low variance).

Like most hyper-parameters, it is often chosen using cross-validation.

Alike to maximal margin classifiers, SVC's only rely on a few observations, those on the margin or those that violate the margin (Support Vectors) - if they are on the correct side of the margin they dont change the classifier. This does mean that they are robust to observations far away from the hyperplane.

TODO

  • refs for above
  • [Insert a line about the bias-variance trade-off]

It is common to define $C$ as $C = \frac{1}{\nu N}$, where $0 < \nu \leq 1$ controls the fraction of misclasified points during the training phase 7.

1.7. Support Vector Machine ¶

Aims to address the situation where the boundry between two classes is not linear.

Feature Engineering¶

We could consider enlarging the feature space using quadratic, cubic or higher-order polynomial functions. This would allow us to project our data onto a higher-dimensional space via a mapping function $\phi$ where they are linearly separable (using a linear SVM model in this new feature space). For example:

$\phi(\mathbf{x}_1, \mathbf{x}_2) = (\mathbf{z}_1,\mathbf{z}_2,\mathbf{z}_3) = (\mathbf{x}_1,\mathbf{x}_2,\mathbf{x}^2_1+\mathbf{x}^2_2)$

Gaussian Radial Basis Function¶

We could instead use a Gaussian Radial Basis Function (RBF),

$\phi_\gamma(X,\ell) = \exp(-\gamma||X-\ell||^2)$.

This is a bell-shaped function which measures how muchy an instanse resembles a landmark, with the function varying from 0 (far away) to 1 (at the landmark).

In the example below we set our landmarks to $x_1$ = -2 and $x_1 = 1$.


Hands on machine learning

TODO

  • Tidy the code below (I just copied and pasted bits because I'm lazy)

Using the example of $x_1=-1$ we can see it is a distance of 1 from the first landmark and 2 from the second.

If we set $\gamma = 0.3$ then our new features are:

$x_2 = \exp(-0.3 \times 1^2) \approx 0.74$

$x_3 = \exp(-0.3 \times 2^2) \approx 0.30$

In order to fiund the landmarks the simplist approach is just to create a landmark at each instance in the dataset.

However the larger the number of features, the higher computational burden. Instead it is common to enlarge the feature space using an extension of a SVC termed a Support Vector Machine, which uses kernels.

Polynomial Feature Engineering
1.57 ms ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Polynomial Kernel
1.15 ms ± 6.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Kernels¶

The Kernel trick can relies on the fact we can define our SVM in the form of dot products, $\mathbf{x}_i\cdot \mathbf{x}_j$.

Imagine we had a mapping $\Phi$ which maps the data to some high dimensional Euclidean space

$\Phi: \mathbb{R}^d \mapsto H$,

then we would still be using dot products in $H$. However we could instead use a kernel function

$K(\mathbf{x}_i,\mathbf{x}_j) = \Phi(\mathbf{x}_i)\cdot \Phi(\mathbf{x}_j)$,

meaning we don't need to worry about $\Phi$; saving us computation.

We still have all the same considerations, but replacing $\mathbf{x}_i\cdot \mathbf{x}_j$ with $K(\mathbf{x}_i, \mathbf{x}_j)$ allows us to produce a SVM in infinite dimensional space.

In the test phase, instead of test point computing the sign of $\mathbf{x}^*$ with $\mathbf{w}$, we can use the support vectors, $\mathbf{s}_i$:

$f(\mathbf{x}^*) = \sum_{i = 1}^{N_S}\alpha_iy_i\Phi(\mathbf{s}_i)\cdot \Phi(\mathbf{x}) +b = \sum_{i = 1}^{N_S}\alpha_iy_iK(\mathbf{s}_i,\mathbf{x}) +b$,

avoiding computing $\Phi(\mathbf{x}^*)$.


Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.

Notes

  • Roughly speaking, a kernel can be interpreted as a similaraity function betwen pairs of samples4.
  • $\mathbf{w}$ lives in $H$ ("high dimensional")
  • $H$ is often refered to as a Hilbert space.
  • "it is clear that the above implicit mapping trick will work for any algorithm in which the data only appear as dot products (for example, the nearest neighbor algorithm). This fact has been used to derive a nonlinear version of principal component analysis by (Scholkopf, Smola and Muller, 1998b); it seems likely that this trick will continue to find uses elsewhere." Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.

Quadratic Kernel Example8

$\mathbf{x} = [q, p]$

Instead of transforming $(q_1,p_1)$ into $(q^2_1,\sqrt{2}q_1p_1,p^2_1)$ and $(q_1,q_2)$ into $(q^2_2,\sqrt{2}q_2p_2,p^2_2)$ and computing,

$$ (q^2_1,\sqrt{2}q_1p_1,p^2_1) \cdot (q^2_2,\sqrt{2}q_2p_2,p^2_2) = (q^2_1q^2_2 + 2q_1q_2p_1,p_2+p^2_1p_2^2), $$

we could instead,

$$ (q_1,p_1) \cdot (q_2,p_2) = (q1q2+p1p2), $$

and then square it

$$ (q^2_1q^2_2 + 2q_1q_2p_1,p_2+p^2_1p_2^2). $$

This is an example of the quadratic kernel $k(\mathbf{x}_i, \mathbf{x}_k) \stackrel{\text{def}}{=} (\mathbf{x}_i\mathbf{x}_k)^2$.

Polynomial kernels¶

[INSERT]

Notes

  • argueably a polynomial is not a proper kernel (see REF)
  • If you start to overfit you should reduce the polynomial degree and underfitting try increasing it.

Radial Basis Function kernel¶

The most widely used kernel is the RBF kernel (also known as a Gaussian kernel)4,8:

$$ k(\mathbf{x}, \mathbf{x}^\prime) = \exp\left(\frac{||\mathbf{x}-\mathbf{x}^\prime||^2}{2\sigma^2}\right), $$

or simply

$$ \kappa(\mathbf{x},\mathbf{x}^\prime) = \exp\left(-\gamma||\mathbf{x}-\mathbf{x}^\prime||^2\right), \\ \gamma = \frac{1}{2\sigma^2}, $$

where $\gamma$ is a free parameter to be optimised and $||\mathbf{x}-\mathbf{x}^\prime||^2$ is the squared Euclidean distance between two feature vectors.

When classifying a test observation $x^* = (x^*_1...x^*_p)$, only training observations close to $x^*$ (in terms of Euclidean distance) will play a role in its class label. This is because $(x^*_j-x_{ij})^2$ will be large, so $\exp(-\gamma\sum^P_{j=1}(x^*_j-x_{ij})^2)$ will be smallREF.

TODO

  • reference last sentence above and fix notation

Notes

Euclidean distance8

$$ d(\mathbf{x}_i, \mathbf{x}_k) \stackrel{\text{def}}{=} \sqrt{\left(x_i^{(1)}-x_k^{(1)}\right)^2+\left(x_i^{(2)}-x_k^{(2)}\right)^2 + \ldots + \left(x_i^{(N)}-x_k^{(N)}\right)^2} = \sqrt{\sum_{j=1}^D\left(x_i^{(j)}-x_k^{(j)}\right)^2} $$

Increasing gamma ($\gamma$) makes the bell-shaped curve narrower (so each instances range of influence is smaller) - having the effect that the boundary becomes more irregular. Small $\gamma$ makes the bell-shaped curve wider - instances have a larger range of influence and therefore a smoother decision boundary2.

$\gamma$ is effectively acting like a regularization hyperparameter, so like $C$ if your model is overfitting reduce it and underfitting then increase itREF<\sup>.

$C$ and $\gamma$ are tightly coupled. Generally if you have a narrow $\gamma = 5$ you'll need more regularizartion, so a small $C$ (so $\lambda = \frac{1}{C}$ is big). And if you have a smaller (wider) $\gamma = 1$, a larger value of $C$ should be used7<\sup>.

Phi(-1.0, -2) = [0.74081822]
Phi(-1.0, 1) = [0.30119421]

Other kernels exist, but less commonly used. It depends on the data structure

[create table with examples starting with Sring kernels for text documents or DNA sequences]

see pg. 161 of hands on ML

Advantages¶

"It is this fact that allows us to construct hyperplanes in these very high dimensional spaces yet still be left with a tractable computation. Thus SVMs circumvent both forms of the “curse of dimensionality”: the proliferation of parameters causing intractable complexity, and the proliferation of parameters causing overfitting." Burges (1998)

Disadvantages¶

"Perhaps the biggest limitation of the support vector approach lies in choice of the kernel. Once the kernel is fixed, SVM classifiers have only one user-chosen parameter (the error penalty), but the kernel is a very big rug under which to sweep parameters. Some work has been done on limiting kernels using prior knowledge (Scholkopf et al., 1998a; Burges, 1998), but the best choice of kernel for a given problem is still a research issue." Burges (1998)

"A second limitation is speed and size, both in training and testing. While the speed problem in test phase is largely solved in (Burges, 1996), this still requires two training passes. Training for very large datasets (millions of support vectors) is an unsolved problem" Burges (1998)

"Discrete data presents another problem, although with suitable rescaling excellent results have nevertheless been obtained (Joachims, 1997)." Burges (1998)

Associated Exercises¶

Now might be a good time to try exercises X-X.

References¶

  1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
  2. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
  3. Zheng, A., & Casari, A. (2018). Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. " O'Reilly Media, Inc.".
  4. Raschka, 2016
  5. Cortes, C. and Vapnik, V. Support vector networks. Machine Learning, 20:273–297, 1995
  6. Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
  7. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
  8. Burkov, A. (2019). The hundred-page machine learning book (Vol. 1). Canada: Andriy Burkov.

web1. https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html web2. https://scikit-learn.org/stable/datasets/toy_dataset.html